Estimating the Proportion of True Null Hypotheses for Multiple Comparisons

Whole genome microarray investigations (e.g. differential expression, differential methylation, ChIP-Chip) provide opportunities to test millions of features in a genome. Traditional multiple comparison procedures such as familywise error rate (FWER) controlling procedures are too conservative. Although false discovery rate (FDR) procedures have been suggested as having greater power, the control itself is not exact and depends on the proportion of true null hypotheses. Because this proportion is unknown, it has to be accurately (small bias, small variance) estimated, preferably using a simple calculation that can be made accessible to the general scientific community. We propose an easy-to-implement method and make the R code available, for estimating the proportion of true null hypotheses. This estimate has relatively small bias and small variance as demonstrated by (simulated and real data) comparing it with four existing procedures. Although presented here in the context of microarrays, this estimate is applicable for many multiple comparison situations.


Introduction
Genomic technologies are producing vast amounts of biological data that are the basis for investigations that require repetitive testing of the same hypothesis. Because the number of tests performed (e.g. differential expression) is so large, sometimes the multiple comparison procedures that control the familywise error rate are too strict for biological applications (e.g. differential methylation). In fact, many biologists would rather experience several more false positives (i.e. type I errors; false rejections of the null hypothesis) than lose important information. In an attempt to address the multiple comparison issue Benjamini and Hochberg (1995) introduced an error rate measure called False Discovery Rate (FDR). Specifically, a family of m hypothesis tests is considered, of which m 0 are true. The proportion of erroneously rejected null hypotheses among all the rejected null hypotheses can be cap tured by the random variable Q = V/R,where R is the number of rejected hypotheses and V is the number of false rejections (type I errors). Benjamini and Hochberg (1995) formally define the FDR to be the expected proportion of falsely rejected hypotheses among all the rejections, where Q = 0 when R = 0 (no rejections). If we let p (1) Յ p (2) Յ ··· Յ p (m) be the ordered p-values and H (i) be the null hypothesis corresponding to p (i) , then in Benjamini and Hochberg's (BH) FDR controlling procedure (Benjamini and Hochberg, 1995), K is considered to be the largest k such that p (k) Յ (k/m)α, where α is the pre-chosen FDR significance level. If K exists, all null hypotheses H (i ) , i = 1, ··· ,K are rejected. If no such K exists, then no hypotheses are rejected. The BH FDR controlling procedure controls the FDR at exactly the level (m 0 /m)α Յ α, and hence conservatively at α , for independent test statistics and for any configuration of false null hypotheses (Benjamini and Yekutieli, 2001;Storey, et al. 2004). In 2000 Benjamini and Hochberg proposed an adaptive procedure which provides more power than the original FDR controlling procedure by comparing each p (k) with (k/m 0 )α where m 0 is an estimate of m 0 . If the estimated value of m 0 is such that m 0 Ն m 0 with probability one, then the adaptive BH FDR controlling procedure will lead to FDR m m m m m m = ( ) = ≤ 0 0 0 0α α α. Because the accuracy and variation of the estimate of m 0 , or π 0 = m 0 /m, directly affects the performance of the adaptive FDR controlling procedure our focus is on the estimation and effect of π 0 .
We propose a simple and easy-to-implement method for estimating the proportion of true null hypotheses. The performance of this estimate is compared to existing methods via simulated and real data. Specifically, Benjamini and Hochberg (2000) estimated the number of true hypotheses from the observed p-values using the Lowest SLope (LSL) estimator. Their approach was based on a modification of the graphical method of Schweder and Spjotvoll (1982). Alternatively, Storey (2002) proposed an estimate of π 0 by assuming the p-values corresponding to true null hypotheses are uniformly distributed on the interval (0,1) and selecting a reasonable tuning parameter 0 Յ λ Ͻ 1. Finally, Langaas, et al. (2005) derived estimators based on nonparametric maximum likelihood estimation of the p-value density, under the restriction of decreasing and convex decreasing densities. Although Benjamini and Hochberg's original and adaptive FDR controlling procedure are developed for independent statistics these procedures can also be applied to some dependence structures (Benjamini and Yekutieli, 2001). Simulations have also demonstrated that they can be used for situations where there is a weak correlation structure among the genes (Storey et al. 2004). However, because of the small number of biological replicates used in most microarray experiments, it is very diffi cult to measure the correlation structure among a set or family of genes. Reiner et al. (2003) proposed a procedure for the general case, but it is conservative when compared to the adaptive FDR controlling procedures.

Storey's approach
Our approach is motivated by the work of Storey (2002), where the proportion of true null hypotheses, π 0 , is estimated by where W(λ) = #{p i : p i Ͼ λ}, and 0 Յ λ Ͻ 1 is a tuning parameter. The rationale for this estimate is that p-values corresponding to true null hypotheses are uniformly distributed on the interval (0,1), of which most should be close to 1. Thus, for a reasonable λ, there are about m 0 (1 − λ) such p-values in the interval (λ,1] such that W(λ) ≈ m 0 (1 − λ). Black (2004) pointed out that Equation (2) is an unbiased estimate of π 0 for all values of λ if all the null hypotheses are true and the p-values have a uniform distribution on the interval (0,1). However, there is an upward bias when the p-values come from both true null and true alternative hypotheses.
As it turns out, choosing the tuning parameter λ in Equation (2) is very important since there is a biasvariance trade-off. When λ → 0, the variance of 0 ( ) π λ becomes smaller and the bias of this estimate increases. When λ → 1, the bias of 0 ( ) π λ becomes smaller, and the variance of this estimate increases. To address this point, Storey et al. (2004) proposed a bootstrap method that automatically chooses λ when estimating 0 ( ) π λ . Instead of choosing one specific λ, Storey and Tibshirani (2003) proposed an estimate of π 0 using limˆ( ) λ π λ →1 0 so that the bias is small and there is a balance between both bias and variance. For this approach, 0 ( ) π λ is plotted over a range of λ = 0,0.05,0.10,··· ,0.90 and then a natural cubic smoothing spline is fit to these data for the purpose of estimating the overall trend of 0 ( ) π λ as λ → 1. In the QVALUE (http://faculty.washington.edu/∼ jstorey/) software, the predicted value of 0 ( ) π λ at λ = 0.90 is chosen as the estimate of π 0 .

Average estimate approach
As mentioned previously, the estimate where 0 Յ λ Ͻ 1, has a large bias and small variance when λ is small and a small bias and large variance when λ is big. Suppose for each λ i ,where 0 Ͻ λ 1 Ͻλ 2 Ͻ ··· Ͻ λ n Ͻ 1, we compute 0 ( ) i π λ as in Equation (2), then  The bias of , is smaller than δ 1 (the bias of the estimate of π 0 at λ = λ 1 ) and at the same time, π 0 has a smaller variance. Considering the average of π 0 (λ) over a range of λ to estimate π 0 reduces the problem to choosing the range of λ.
is an estimate of π 0 via Equation (2) with λ = t i . The goal then becomes finding a subset of t i 's such that a new estimate of π 0 is obtained by taking the average of the corresponding values of π 0 (t i ). Let NB i denote the number of p-values which are greater than or equal to t i , and let NS i represent the number of p-values in the interval of [t i ,t i+1 ). Therefore, If the NB i p-values come from the null distribution, then on average there are Since the p-values corresponding to the true alternative hypotheses are smaller than those corresponding to the true null hypotheses, there are more p-values in the intervals [t i ,t i +1 ) with small index i. For small i, NS i is usually greater than . If such i exists, t i can be considered as the change point and we assume all the p-values bigger than t i come from the true null hypotheses. From this π 0 can be estimated by where i = min{i : In order to find the range of λ , only a lower bound of λ is required. The large values of t i are used so that it ensures the estimate of π 0 has small bias. This is equivalent to fitting a straight line with slope 0 in the right bottom part of a π 0 (t i ) versus t i plot, such that the intercept provides the estimate of π 0 . A simple modification of this approach is to estimate π 0 by taking the average of π 0 (t j ) from j = i−1 to B, that is, }. This ensures that the upward bias increases and the variance decreases, as π 0 (t i −1 ) has smaller variance and bigger bias than π 0 (t j ) for j = i, ··· , B. A remaining challenge is how to choose B. Specifically, how many λ's should be used in the interval [0,1]. Recall that a motivating factor of the proposed average estimate approach is to balance the bias and variance. The natural way to measure both the bias and variance is the meansquared error, . Since the true value of π 0 is unknown and the theoretical result is intractable, we take a bootstrap approach in the following way: 1. For each B ∈ I, I = {5, 10, 20, 50, 100}, compute π 0 (B) as in Equation (8) Notice that in step three the value of π 0 is estimated by the average of the π 0 (B) over arrange of B.

Simulation studies
To investigate the performance of the proposed average estimate approach, a simulation study was performed. Taking m = 1,000 (i.e. 1,000 genes are tested for differential expression), let π 0 vary over a wide range, say π 0 = 0.50,0.60, ··· ,0.90 which are reasonable for microarray experiments. Hypotheses, H 0 : µ = 0 versus H a : µ Ͼ 0, are tested for independent random variables Z i (i = 1, ··· ,m) from null distribution N(0,1) and alternative distribution N(2,1). Specifically, mπ 0 and m(1 −π 0 ) random variables have mean 0 and 2, respectively. For each test, the p-value is computed as where Z is a random variable from a standard normal distribution N(0,1) and z i is the observed value of Z i . For each value of π 0 , l = 1,000 data sets were simulated.
For the choice of B, B is either fixed (i.e. B = 5, 10, 20, 50, and 100) or chosen by the proposed bootstrap approach. For each of the l = 1,000 simulated data sets, when B is fixed, the estimate of π 0 is computed via Equation (8), that is, }. If such i does not exist, π 0 is estimated by the average of π 0 (t B −1 )and π 0 (t B ). For the bootstrap approach to automatically choose B, the range of B is 5, 10, 20, 50, 100.
For completion the performance of the proposed average estimate approach is compared with several existing procedures: For procedures 2 and 3, the QVALUE software (http://faculty.washington.edu/ ∼ jstorey/) was employed. For procedure 4, the R function 'convest' was downloaded from the R library 'limma' as part of the Bioconductor project (http://www. bioconductor.org). Table 1 where π 0i estimates π 0 for the i th simulation, and π 0 is the true value. As demonstrated, the LSL approach has the largest upward bias which guarantees that Benjamini and Hochberg's adaptive FDR controlling procedure controls the FDR below a prechosen FDR level. However, the FDR can be much lower than the pre-chosen FDR level. The LSL approach also has the smallest variation. The last three approaches [2-4] all underestimate the proportion of true null hypotheses. The proposed average estimate approach provides estimates of π 0 that have upward but relatively small bias and relatively small variance regardless of whether B is fixed or automatically chosen via bootstrap procedure. When B increases, the bias increases and the variation decreases. Both the small upward bias and small variance provide evidence that the proposed average estimate approach has better properties when compared to the other approaches.
The average of the true false discovery rate (FDR) from 1000 simulations is also compared in this simulation study by applying Benjamini and Hochberg's adaptive FDR controlling procedure (Benjamini and Hochberg, 2000) with π 0 estimated using the above mentioned five methods (Fig. 1). The FDR significance level was chosen as α = 0.05. For the purpose of comparison, the original BH FDR controlling procedure (Benjamini and Hochberg, 1995) and the adaptive FDR controlling procedure with the incorporation of the true value of π 0 were also applied to the p-values. It can be seen that the original BH FDR controlling procedure has the lowest FDR as expected. Because Benjamini and Hochberg's lowest slope approach overestimates π 0 , the FDR is below, but much lower than, the pre-chosen α, although this approach has a bigger FDR than that of the BH procedure. Storey's bootstrap estimate, the smoother estimate and convest estimate produce higher FDRs than the pre-chosen level because all three methods underestimate π 0 . Our proposed average estimate approach overestimates π 0 , its FDR is below but very close to the pre-chosen significance level α = 0.05. Table 1 also demonstrates that the FDR for the proposed average estimate has the relatively small variation.
The power of the five adaptive FDR controlling procedures is compared (Fig. 2). The power of a procedure is measured by average power which is defined to be the ratio of average number of correct rejections of true alternative hypotheses to the total number of true alternative hypotheses. Formally, average power = E(S)/(m − m 0 ). As illustrated, the power decreases when π 0 increases for all of the FDR controlling procedures. The original BH procedure has the lowest power, while Benjamini and Hochberg's adaptive procedure has the second lowest power. It is not surprising that Storey boot procedure has the largest statistical power, because the FDR of this procedure exceeds the pre-chosen FDR significance level (Fig. 1).

Microarray data application
The same five estimating π 0 methods were also applied to the training samples of the leukemia data of Golub et al. (1999), which consist of 27 patients with acute lymphoblastic leukemia (ALL) and 11 patients with acute myeloid leukemia (AML). The samples were assayed using Affymetrix Hgu6800 chips and the gene expression data of 7129 genes (Affymetrix probes) are available from R library golubEsets. For each gene, a simple two-sample t-test was employed for testing differential gene expression and the p-value was computed. Table 2 gives the estimate of the proportion of true null hypotheses and the number of statistically significant genes.
From this real data analysis, it can be seen that the Benjamni and Hochberg's LSL approach conservatively overestimates π 0 , hence it leads to lowest power in terms of the number of rejections. Our proposed average approach provides a slightly larger estimate than Storey's bootstrap approach, the smoother estimate and the nonparametric maximum likelihood approach (convest), even though they end up with a similar number of rejections.

Summary
As array technology improves, it is anticipated that the number of features per array will only increase, hence multiple comparisons will continue to be a challenging problem. Specific to microarrays, the false discovery rate (FDR) is preferred to familywise error rate (FWER) because the FDR controlling procedures have more statistical power than the FWER controlling procedures, even at the cost Table 1. The estimate of the proportion of true null hypotheses is compared for: Benjamini and Hochberg's lowest slope approach (LSL), Storey's π 0 (λ) estimate with λ selected via bootstrapping (Storey boot ), Storey and Tibshirani's smoother method (ST smoother ), Langass's nonparametric maximum likelihood approach (convest), and the proposed average estimate approach with fixed values of B = 5, 10, 20, 50, 100 and with B chosen via the bootstrapping procedure (B boot ). There are 1,000 simulated data sets, each with a total of m = 1, 000 hypothesis tests, for each value of π 0 . of a few more type I errors (i.e. false positives). Since Benjamini and Hochberg (1995) proposed their FDR controlling procedure, a variety of methods have been proposed to estimate π 0 , the proportion of true null hypotheses. As seen here, overestimating π 0 controls the FDR below the specified rate. When our and others, estimate of π 0 is incorporated into the Benjamini and Hochberg's FDR controlling procedure, the adaptive FDR controlling procedure has more power and an FDR close to the pre-chosen level.
In this work, we have compared several methods for estimating the proportion of true null hypotheses (π 0 ). Benjamini Hochberg's lowest slope approach (Benjamini and Hochberg, 2000) overestimates π 0 . Storey's estimate π 0 (λ) (Storey, 2002) also overestimates π 0 for any fixed value 0 Յ λ Ͻ 1. When λ → 1, the bias becomes smaller, and the variance becomes bigger. In order to find the optimal λ such that π 0 (λ) has small variation, Storey proposed a bootstrapping method (Storey et al. 2004). However, this method underestimates π 0 and the downward bias increases as the true value π 0 gets bigger. Storey and Tibshirani (2003) proposed a smoother method to estimate limˆ( ) λ π λ →1 0 such that this estimate has small bias. Unfortunately, this method also underestimates π 0 , although the bias is very small. Furthermore, the variation of this estimate is relatively large, which makes the adaptive FDR controlling procedure unstable. More recently, Langaas et al. (2005) proposed an estimate based on the nonparametric maximum likelihood function of the p-value density restricted to convex decreasing densities. However, this method also underestimates π 0 , most likely because the distribution of the p-values is not decreasing for large p-values and tends to be flat. Using the limitations of the existing approaches for estimating π 0 as the motivation, we propose the average estimate approach by taking average of the estimates of π 0 over a range of equally spaced points on the interval [0,1]. While our average estimate approach has a slightly larger bias, it also has smaller variation than any of the other methods. Furthermore, when compared to the other methods it is easy to implement (e.g. Excel) when the number of points used in approach is fixed (say, B = 10), and can be automated to choose B via a bootstrap procedure (R code available: www.stat.purdue.edu/∼doerge). When our proposed estimated value of π 0 is incorporated into Benjamini and Hochberg's adaptive FDR controlling procedure, more statistical power is gained such that the FDR can be controlled below, yet extremely close to a desired level α. Table 2. The estimate of the proportion of true null hypotheses and the number of statistically significant genes for the leukemai data (Golub et al. 1999) at significance level α = 0.05 after applying Benjamni and Hochberg's adaptive FDR controlling procedure with π 0 estimated using five methods: Benjamini and Hochberg's lowest slope approach (LSL), Storey's π 0 (λ) estimate with λ selected via bootstrapping (Storey boot ), Storey and Tibshirani's smoother method (ST smoother ), Langass's convest approach (convest), and the proposed average approach with B chosen via the bootstrapping procedure (B boot ). A two-sample t-test was used to compute the p-values.

Method
Estimate of π